Search CORE

14 research outputs found

Maat: Performance Metric Anomaly Anticipation for Cloud Services with Conditional Diffusion

Author: Chen Zhuangbin
Lee Cheryl
Lyu Michael R.
Su Yuxin
Yang Tianyi
Publication venue
Publication date: 15/08/2023
Field of study

Ensuring the reliability and user satisfaction of cloud services necessitates prompt anomaly detection followed by diagnosis. Existing techniques for anomaly detection focus solely on real-time detection, meaning that anomaly alerts are issued as soon as anomalies occur. However, anomalies can propagate and escalate into failures, making faster-than-real-time anomaly detection highly desirable for expediting downstream analysis and intervention. This paper proposes Maat, the first work to address anomaly anticipation of performance metrics in cloud services. Maat adopts a novel two-stage paradigm for anomaly anticipation, consisting of metric forecasting and anomaly detection on forecasts. The metric forecasting stage employs a conditional denoising diffusion model to enable multi-step forecasting in an auto-regressive manner. The detection stage extracts anomaly-indicating features based on domain knowledge and applies isolation forest with incremental learning to detect upcoming anomalies. Thus, our method can uncover anomalies that better conform to human expertise. Evaluation on three publicly available datasets demonstrates that Maat can anticipate anomalies faster than real-time comparatively or more effectively compared with state-of-the-art real-time anomaly detectors. We also present cases highlighting Maat's success in forecasting abnormal metrics and discovering anomalies.Comment: This paper has been accepted by the Research track of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023

arXiv.org e-Print Archive

Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention

Author: Chen Zhuangbin
Lee Cheryl
Lyu Michael R.
Su Yuxin
Yang Tianyi
Yang Yongqiang
Publication venue
Publication date: 14/02/2023
Field of study

Prompt and accurate detection of system anomalies is essential to ensure the reliability of software systems. Unlike manual efforts that exploit all available run-time information, existing approaches usually leverage only a single type of monitoring data (often logs or metrics) or fail to make effective use of the joint information among different types of data. Consequently, many false predictions occur. To better understand the manifestations of system anomalies, we conduct a systematical study on a large amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates that logs and metrics can manifest system anomalies collaboratively and complementarily, and neither of them only is sufficient. Thus, integrating heterogeneous data can help recover the complete picture of a system's health status. In this context, we propose Hades, the first end-to-end semi-supervised approach to effectively identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. It captures discriminative features and meaningful interactions from heterogeneous data via a cross-modal attention module, trained in a semi-supervised manner. We evaluate Hades extensively on large-scale simulated data and datasets from Huawei Cloud. The experimental results present the effectiveness of our model in detecting system anomalies. We also release the code and the annotated dataset for replication and future research.Comment: In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). arXiv admin note: substantial text overlap with arXiv:2207.0291

arXiv.org e-Print Archive

Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection

Author: Chen Zhuangbin
Feng Cong
Gu Jiazhen
Gu Wenwei
Liu Jinyang
Lyu Michael
Su Yuxin
Yang Zengyin
Zhang Jianping
Publication venue
Publication date: 20/07/2023
Field of study

Performance issues permeate large-scale cloud service systems, which can lead to huge revenue losses. To ensure reliable performance, it's essential to accurately identify and localize these issues using service monitoring metrics. Given the complexity and scale of modern cloud systems, this task can be challenging and may require extensive expertise and resources beyond the capacity of individual humans. Some existing methods tackle this problem by analyzing each metric independently to detect anomalies. However, this could incur overwhelming alert storms that are difficult for engineers to diagnose manually. To pursue better performance, not only the temporal patterns of metrics but also the correlation between metrics (i.e., relational patterns) should be considered, which can be formulated as a multivariate metrics anomaly detection problem. However, most of the studies fall short of extracting these two types of features explicitly. Moreover, there exist some unlabeled anomalies mixed in the training data, which may hinder the detection performance. To address these limitations, we propose the Relational- Temporal Anomaly Detection Model (RTAnomaly) that combines the relational and temporal information of metrics. RTAnomaly employs a graph attention layer to learn the dependencies among metrics, which will further help pinpoint the anomalous metrics that may cause the anomaly effectively. In addition, we exploit the concept of positive unlabeled learning to address the issue of potential anomalies in the training data. To evaluate our method, we conduct experiments on a public dataset and two industrial datasets. RTAnomaly outperforms all the baseline models by achieving an average F1 score of 0.929 and Hit@3 of 0.920, demonstrating its superiority

arXiv.org e-Print Archive

A Large-scale Benchmark for Log Parsing

Author: Chen Zhuangbin
Gu Jiazhen
Huang Junjie
Huo Yintong
Jiang Zhihan
Li Yichen
Liu Jinyang
Lyu Michael R.
Zhu Jieming
Publication venue
Publication date: 21/08/2023
Field of study

Log data is pivotal in activities like anomaly detection and failure diagnosis in the automated maintenance of software systems. Due to their unstructured format, log parsing is often required to transform them into a structured format for automated analysis. A variety of log parsers exist, making it vital to benchmark these tools to comprehend their features and performance. However, existing datasets for log parsing are limited in terms of scale and representativeness, posing challenges for studies that aim to evaluate or develop log parsers. This problem becomes more pronounced when these parsers are evaluated for production use. To address these issues, we introduce a new collection of large-scale annotated log datasets, named LogPub, which more accurately mirrors log data observed in real-world software systems. LogPub comprises 14 datasets, each averaging 3.6 million log lines. Utilizing LogPub, we re-evaluate 15 log parsers in a more rigorous and practical setting. We also propose a new evaluation metric to lessen the sensitivity of current metrics to imbalanced data distribution. Furthermore, we are the first to scrutinize the detailed performance of log parsers on logs that represent rare system events and offer comprehensive information for system troubleshooting. Parsing such logs accurately is vital yet challenging. We believe that our work could shed light on the design and evaluation of log parsers in more realistic settings, thereby facilitating their implementation in production systems

arXiv.org e-Print Archive

Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems

Author: Chen Zhuangbin
Feng Cong
Gu Jiazhen
Huang Junjie
Jiang Zhihan
Liu Jinyang
Lyu Michael R.
Yang Yongqiang
Yang Zengyin
Publication venue
Publication date: 15/08/2023
Field of study

Ensuring the reliability of cloud systems is critical for both cloud vendors and customers. Cloud systems often rely on virtualization techniques to create instances of hardware resources, such as virtual machines. However, virtualization hinders the observability of cloud systems, making it challenging to diagnose platform-level issues. To improve system observability, we propose to infer functional clusters of instances, i.e., groups of instances having similar functionalities. We first conduct a pilot study on a large-scale cloud system, i.e., Huawei Cloud, demonstrating that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we formulate the identification of functional clusters as a clustering problem and propose a non-intrusive solution called Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions instances into coarse-grained chunks based on communication patterns. Within each chunk, Prism further groups instances with similar resource usage patterns to produce fine-grained functional clusters. Such a design reduces noises in the data and allows Prism to process massive instances efficiently. We evaluate Prism on two datasets collected from the real-world production environment of Huawei Cloud. Our experiments show that Prism achieves a v-measure of ~0.95, surpassing existing state-of-the-art solutions. Additionally, we illustrate the integration of Prism within monitoring systems for enhanced cloud reliability through two real-world use cases.Comment: The paper was accepted by the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023

arXiv.org e-Print Archive

Corrigendum to: The TianQin project: current progress on science and technology

Author: Bai Yan-Zheng
Bao Jiahui
Barausse Enrico
Cai Lin
Canuto Enrico
Cao Bin
Chen Wei-Ming
Chen Yu
Ding Yan-Wei
Duan Hui-Zong
Fan Huimin
Feng Wen-Fan
Fu Honglin
Gao Qing
Gao TianQuan
Gong Yungui
Gou Xingyu
Gu Chao-Zheng
Gu De-Feng
He Zi-Qi
Hendry Martin
Hong Wei
Hu Xin-Chun
Hu Yi-Ming
Hu Yuexin
Huang Shun-Jia
Huang Xiang-Qing
Jiang Qinghua
Jiang Yuan-Ze
Jiang Yun
Jiang Zhen
Jin Hong-Ming
Korol Valeriya
Li Hong-Yin
Li Ming
Li Ming
Li Pengcheng
Li Rongwang
Li Yuqiang
Li Zhu
Li Zhu-Xi
Li Zhulian
Liang Yu-Rong
Liang Zheng-Cheng
Liao Fang-Jie
Liu Li
Liu Pei-Bo
Liu Qi
Liu Shuai
Liu Xuhui
Liu Yan-Chong
Liu Yuan
Lu Xiong-Fei
Lu Yang
Lu Ze-Huang
Luo Jun
Luo Yan
Luo Zhi-Cai
Mei Jianwei
Milyukov Vadim
Ming Min
Pi Xiaoyu
Qin Chenggang
Qu Shao-Bo
Sesana Alberto
Shao Chenggang
Shi Changfu
Su Wei
Tan Ding-Yin
Tan Yujie
Tan Zhuangbin
Tu Liang-Cheng
Wang Bin
Wang Cheng-Rui
Wang Fengbin
Wang Guan-Fang
Wang Haitian
Wang Jian
Wang Lijiao
Wang Panpan
Wang Xudong
Wang Yan
Wang Yi-Fan
Wei Ran
Wu Shu-Chao
Xiao Chun-Yu
Xu Xiao-Shi
Xue Chao
Yang Fang-Chao
Yang Liang
Yang Ming-Lin
Yang Shan-Qing
Ye Bobing
Yeh Hsien-Chi
Yu Shenghua
Zhai Dongsheng
Zhang Caishi
Zhang Haitao
Zhang Jian-dong
Zhang Jie
Zhang Lihua
Zhang Xin
Zhang Xuefeng
Zhou Hao
Zhou Ming-Yue
Zhou Ze-Bing
Zhu Dong-Dong
Zi Tie-Guang
Publication venue: Oxford : Oxford Univ. Press
Publication date: 01/01/2021
Field of study

In the originally published version, this manuscript included an error related to indicating the corresponding author within the author list. This has now been corrected online to reflect the fact that author Jun Luo is the corresponding author of the article

Institutionelles Repositorium der Leibniz Universität Hannover

Reliability Improved Cooperative Communication over Wireless Sensor Networks

Author: Anfeng Liu
Ming Ma
Ming Zhao
Xiao Liu
Zhuangbin Chen
Publication venue: 'MDPI AG'
Publication date: 01/10/2017
Field of study

With the development of smart devices and connection technologies, Wireless Sensor Networks (WSNs) are becoming increasingly intelligent. New or special functions can be obtained by receiving new versions of program codes to upgrade their software systems, forming the so-called smart Internet of Things (IoT). Due to the lossy property of wireless channels, data collection in WSNs still suffers from a long delay, high energy consumption, and many retransmissions. Thanks to wireless software-defined networks (WSDNs), software in sensors can now be updated to help them transmit data cooperatively, thereby achieving more reliable communication. In this paper, a Reliability Improved Cooperative Communication (RICC) data collection scheme is proposed to improve the reliability of random-network-coding-based cooperative communications in multi-hop relay WSNs without reducing the network lifetime. In WSNs, sensors in different positions can have different numbers of packets to handle, resulting in the unbalanced energy consumption of the network. In particular, nodes in non-hotspot areas have up to 90% of their original energy remaining when the network dies. To efficiently use the residual energy, in RICC, high data transmission power is adopted in non-hotspot areas to achieve a higher reliability at the cost of large energy consumption, and relatively low transmission power is adopted in hotspot areas to maintain the long network lifetime. Therefore, high reliability and a long network lifetime can be obtained simultaneously. The simulation results show that compared with other scheme, RICC can reduce the end-to-end Message Fail delivering Ratio (MFR) by 59.4%–62.8% under the same lifetime with a more balanced energy utilization

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Heterogeneous Anomaly Detection for Software Systems via Attentive Multi-modal Learning

Author: Chen Zhuangbin
Li Baitong
Lyu Michael R.
Su Yuxin
Yang Tianyi
Yang Yongqiang
Publication venue
Publication date: 22/06/2022
Field of study

Prompt and accurate detection of system anomalies is essential to ensure the reliability of software systems. Unlike manual efforts that exploit all available run-time information, existing approaches usually leverage only a single type of monitoring data (often logs or metrics) or fail to make effective use of the joint information among multi-source data. Consequently, many false predictions occur. To better understand the manifestations of system anomalies, we conduct a comprehensive empirical study based on a large amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates that system anomalies could manifest distinctly in different data types. Thus, integrating heterogeneous data can help recover the complete picture of a system's health status. In this context, we propose HADES, the first work to effectively identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. It captures discriminative features and meaningful interactions from multi-modal data via a novel cross-modal attention module, enabling accurate system anomaly detection. We evaluate HADES extensively on large-scale simulated and industrial datasets. The experimental results present the superiority of HADES in detecting system anomalies on heterogeneous data. We release the code and the annotated dataset for reproducibility and future research

arXiv.org e-Print Archive